DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

HW2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read Data
  • Data Visualization
  • Babynames
  • Tidy Table

HW2

  • Show All Code
  • Hide All Code

  • View Source
hw2
Lai Wei
Author

Lai Wei

Published

November 17, 2022

In this HW, I will use mmr_2015.csv, which is a data set that contains a subset of the (real) data that were used to generate the United Nations Maternal mortality estimates, as published in the year 2015.

Code
library(tidyverse)
library(babynames)
Error in library(babynames): there is no package called 'babynames'
Code
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)

Read Data

Background for mmr_2015.csv: The maternal mortality ratio (MMR) is defined as the number of maternal deaths per 100,000 live births. The UN maternal mortality estimation group produces estimates of the MMR for all countries in the world.

Code
mmr <- read.csv("_data/mmr_2015.csv")
mmr
    iso                    country year      mmr
1   AFG                Afghanistan 2000 1900.000
2   DZA                    Algeria 1999  117.410
3   BGD                 Bangladesh 2009  194.000
4   BGD                 Bangladesh 2000  322.156
5   BWA                   Botswana 2006  139.790
6   BWA                   Botswana 2012  147.900
7   BWA                   Botswana 2005  157.700
8   BWA                   Botswana 2010  163.000
9   BWA                   Botswana 2013  182.600
10  BWA                   Botswana 2007  183.470
11  BWA                   Botswana 2011  188.860
12  BWA                   Botswana 2009  189.570
13  BWA                   Botswana 2008  195.730
14  CMR                   Cameroon 2010  652.000
15  CHN                      China 2012   24.500
16  CHN                      China 2011   26.100
17  CHN                      China 2010   30.000
18  CHN                      China 2009   31.900
19  CHN                      China 2008   34.200
20  CHN                      China 2007   36.600
21  CHN                      China 2006   41.100
22  CHN                      China 2002   43.200
23  CHN                      China 2005   47.700
24  CHN                      China 2004   48.300
25  CHN                      China 2001   50.200
26  CHN                      China 2003   51.300
27  CHN                      China 2000   53.000
28  CHN                      China 1999   58.700
29  CHN                      China 1995   61.900
30  CHN                      China 1997   63.600
31  CHN                      China 1990   88.900
32  EGY                      Egypt 2006   59.000
33  EGY                      Egypt 2004   68.000
34  EGY                      Egypt 2000   84.000
35  EGY                      Egypt 1992  174.000
36  SLV                El Salvador 2008   51.090
37  SLV                El Salvador 2007   56.930
38  HND                   Honduras 2012   74.100
39  HND                   Honduras 2013   86.000
40  IND                      India 2012  167.000
41  IND                      India 2011  178.000
42  IND                      India 2008  212.000
43  IND                      India 2005  254.000
44  IND                      India 2002  301.000
45  IND                      India 2000  327.000
46  IND                      India 1998  398.000
47  IND                      India 1992  437.000
48  IND                      India 1999  540.000
49  IRN Iran (Islamic Republic of) 2013   19.700
50  IRN Iran (Islamic Republic of) 2012   19.900
51  IRN Iran (Islamic Republic of) 2008   20.900
52  IRN Iran (Islamic Republic of) 2011   21.500
53  IRN Iran (Islamic Republic of) 2006   21.700
54  IRN Iran (Islamic Republic of) 2010   22.100
55  IRN Iran (Islamic Republic of) 2005   23.800
56  IRN Iran (Islamic Republic of) 2004   24.100
57  IRN Iran (Islamic Republic of) 2007   24.700
58  IRN Iran (Islamic Republic of) 2009   25.400
59  IRN Iran (Islamic Republic of) 2002   27.400
60  IRN Iran (Islamic Republic of) 2003   28.300
61  IRQ                       Iraq 2012   35.000
62  JAM                    Jamaica 2009   73.200
63  JAM                    Jamaica 1999   73.500
64  JAM                    Jamaica 1995   81.300
65  JAM                    Jamaica 2012   81.300
66  JAM                    Jamaica 2004   82.500
67  JAM                    Jamaica 2003   89.800
68  JAM                    Jamaica 1998   90.100
69  JAM                    Jamaica 2000   90.300
70  JAM                    Jamaica 2001   91.500
71  JAM                    Jamaica 2007   92.900
72  JAM                    Jamaica 2011   95.700
73  JAM                    Jamaica 2006   96.700
74  JAM                    Jamaica 2008  102.000
75  JAM                    Jamaica 2005  109.200
76  JAM                    Jamaica 2002  110.500
77  JAM                    Jamaica 2010  113.300
78  MNG                   Mongolia 2014   30.200
79  MNG                   Mongolia 2013   42.600
80  MNG                   Mongolia 2010   47.400
81  MNG                   Mongolia 2008   48.600
82  MNG                   Mongolia 2011   48.700
83  MNG                   Mongolia 2012   51.500
84  MNG                   Mongolia 2006   67.200
85  MNG                   Mongolia 2009   81.000
86  MNG                   Mongolia 2007   88.300
87  MNG                   Mongolia 2005   92.700
88  MNG                   Mongolia 2004   96.700
89  MNG                   Mongolia 2003  107.200
90  MNG                   Mongolia 2002  121.500
91  MNG                   Mongolia 1997  143.500
92  MNG                   Mongolia 1998  162.400
93  MNG                   Mongolia 2001  165.000
94  MNG                   Mongolia 2000  166.200
95  MNG                   Mongolia 1996  173.700
96  MNG                   Mongolia 1999  182.000
97  MNG                   Mongolia 1995  186.000
98  MNG                   Mongolia 1992  203.900
99  MNG                   Mongolia 1994  219.000
100 MNG                   Mongolia 1993  259.000
101 MMR                    Myanmar 2004  315.860
102 NPL                      Nepal 2008  229.000
103 OMN                       Oman 1993    6.000
104 OMN                       Oman 1997   13.300
105 OMN                       Oman 1999   13.700
106 OMN                       Oman 2005   15.400
107 OMN                       Oman 1992   15.900
108 OMN                       Oman 2000   16.100
109 OMN                       Oman 1998   18.500
110 OMN                       Oman 2004   18.500
111 OMN                       Oman 1996   21.000
112 OMN                       Oman 1995   22.000
113 OMN                       Oman 2001   23.100
114 OMN                       Oman 2003   23.200
115 OMN                       Oman 1994   24.400
116 OMN                       Oman 1991   27.400
117 OMN                       Oman 2002   37.500
118 PRY                   Paraguay 2001  178.000
119 PER                       Peru 2011   92.700
120 PER                       Peru 2010   95.900
121 PER                       Peru 2009   96.100
122 PER                       Peru 2008  107.900
123 PER                       Peru 2007  110.500
124 PER                       Peru 2005  114.100
125 PER                       Peru 2006  114.900
126 PER                       Peru 2002  118.300
127 PER                       Peru 2004  120.800
128 PER                       Peru 2003  123.800
129 SAU               Saudi Arabia 1997   23.000
130 SSD                South Sudan 2005 2037.000
131 LKA                  Sri Lanka 2010   31.100
132 LKA                  Sri Lanka 2011   32.500
133 LKA                  Sri Lanka 2008   33.400
134 LKA                  Sri Lanka 2012   37.700
135 LKA                  Sri Lanka 2004   38.000
136 LKA                  Sri Lanka 2007   38.400
137 LKA                  Sri Lanka 2006   39.300
138 LKA                  Sri Lanka 2009   40.200
139 LKA                  Sri Lanka 2003   42.400
140 LKA                  Sri Lanka 2005   44.000
141 LKA                  Sri Lanka 2001   46.600
142 LKA                  Sri Lanka 1998   53.000
143 LKA                  Sri Lanka 2002   53.400
144 LKA                  Sri Lanka 2000   55.600
145 LKA                  Sri Lanka 1999   55.800
146 LKA                  Sri Lanka 1995   61.000
147 LKA                  Sri Lanka 1996   62.000
148 LKA                  Sri Lanka 1997   63.000
149 SDN                      Sudan 2009  215.600
150 SDN                      Sudan 2005  638.000
151 SYR       Syrian Arab Republic 2008   56.000
152 THA                   Thailand 1998   36.400
153 THA                   Thailand 1997   36.500
154 THA                   Thailand 2005   37.400
155 THA                   Thailand 2006   41.600
156 THA                   Thailand 1996   44.100
157 THA                   Thailand 2004   44.500
158 TUN                    Tunisia 2008   44.800
159 TUN                    Tunisia 1993   68.900
160 TUR                     Turkey 2014   15.200
161 TUR                     Turkey 2012   15.400
162 TUR                     Turkey 2011   15.500
163 TUR                     Turkey 2013   15.900
164 TUR                     Turkey 2010   16.400
165 TUR                     Turkey 2009   18.400
166 TUR                     Turkey 2008   19.400
167 TUR                     Turkey 2007   21.200
168 ARE       United Arab Emirates 2005    0.000
169 ARE       United Arab Emirates 2000    0.000
170 ARE       United Arab Emirates 2008    1.400
171 ARE       United Arab Emirates 1995   22.400
172 ARE       United Arab Emirates 1990   32.600
173 VNM                   Viet Nam 2009   69.000
174 VNM                   Viet Nam 2001  130.000
175 YEM                      Yemen 2012  148.160

Variables in the data set mmr_2015.csv are as follows:

  • Iso = ISO code
  • Name = country name
  • Year = observation year
  • MMR = observed maternal mortality ratio, which is defined as the number of maternal deaths/total number of births*100,000

Data Visualization

Construct a graph that shows the observed values of the MMR plotted against year (starting in 2000) for China and Viet Nam.Use the pipe operator so that the graph follows from a multi-line command that starts with “mmr %>%”.Use ggplot() to display the data.

Code
data_IT <- filter(mmr,country == "China"|country == "Viet Nam",year >= 2000)
ggplot(data = data_IT,aes(x = year,y= mmr))+
  geom_point(aes(group = country,color = country))

Babynames

Babynames package is the Names of male and female babies born in the US from 1880 to 2017.Babynames was filtered to include only those rows with year > 1975, sex equal to male, and either prop > 0.025 or n > 50000.

Code
babynames %>% 
  filter(year > 1975, sex == "M",prop > 0.025|n > 50000) %>% 
  ggplot(aes(x = year, y = prop))+
  geom_point(aes(group = name,color = name), size = 2)+
  geom_line(aes(group = name, color = name))+
  expand_limits(y = 0)
Error in filter(., year > 1975, sex == "M", prop > 0.025 | n > 50000): object 'babynames' not found

Tidy Table

Construct and print a tibble that shows the countries sorted by their average observed MMR (rounded to zero digits), with the country with the highest average MMR listed first.

Code
data1<- group_by(mmr,country) %>% 
  summarise_at(vars(mmr),list(name = mean))
  names(data1)[2] = "ave" 
  data1$ave <- round(data1$ave,0)
  arrange(data1,desc(ave))
# A tibble: 30 × 2
   country       ave
   <chr>       <dbl>
 1 South Sudan  2037
 2 Afghanistan  1900
 3 Cameroon      652
 4 Sudan         427
 5 Myanmar       316
 6 India         313
 7 Bangladesh    258
 8 Nepal         229
 9 Paraguay      178
10 Botswana      172
# … with 20 more rows

##Continuing with the mmr data set

Part a: For each year - first calculate the mean observed value for each country. - then rank countries by increasing MMR for each year.

Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table.

Code
data2<-
  mmr %>% 
  group_by(year) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
data2
# A tibble: 175 × 5
# Groups:   year [25]
   iso   country      year   mmr  Mean
   <chr> <chr>       <int> <dbl> <dbl>
 1 SSD   South Sudan  2005 2037  275. 
 2 AFG   Afghanistan  2000 1900  301. 
 3 CMR   Cameroon     2010  652  130. 
 4 SDN   Sudan        2005  638  275. 
 5 IND   India        1999  540  149. 
 6 IND   India        1992  437  208. 
 7 IND   India        1998  398  126. 
 8 IND   India        2000  327  301. 
 9 BGD   Bangladesh   2000  322. 301. 
10 MMR   Myanmar      2004  316.  85.7
# … with 165 more rows
Code
  arrange(data2,desc(Mean)) 
# A tibble: 175 × 5
# Groups:   year [25]
   iso   country               year    mmr  Mean
   <chr> <chr>                <int>  <dbl> <dbl>
 1 AFG   Afghanistan           2000 1900    301.
 2 IND   India                 2000  327    301.
 3 BGD   Bangladesh            2000  322.   301.
 4 MNG   Mongolia              2000  166.   301.
 5 JAM   Jamaica               2000   90.3  301.
 6 EGY   Egypt                 2000   84    301.
 7 LKA   Sri Lanka             2000   55.6  301.
 8 CHN   China                 2000   53    301.
 9 OMN   Oman                  2000   16.1  301.
10 ARE   United Arab Emirates  2000    0    301.
# … with 165 more rows
Code
lowest10 <- print(tail(data2,10))
# A tibble: 10 × 5
# Groups:   year [9]
   iso   country               year   mmr  Mean
   <chr> <chr>                <int> <dbl> <dbl>
 1 TUR   Turkey                2011  15.5  77.7
 2 OMN   Oman                  2005  15.4 275. 
 3 TUR   Turkey                2012  15.4  73.0
 4 TUR   Turkey                2014  15.2  22.7
 5 OMN   Oman                  1999  13.7 149. 
 6 OMN   Oman                  1997  13.3  57.2
 7 OMN   Oman                  1993   6   111. 
 8 ARE   United Arab Emirates  2008   1.4  82.6
 9 ARE   United Arab Emirates  2005   0   275. 
10 ARE   United Arab Emirates  2000   0   301. 

Part b: With rankings calculated separately for two periods, with period 1 referring to years < 2000 and period 2 referring to years >= 2000.

For each period

  • first calculate the mean observed value for each country
  • then rank countries by increasing MMR for each period.

Calculate the mean ranking across all periods, extract the 10 countries with the lowest ranking across all periods, and print the table.

Code
before_2000<-mmr %>% 
  filter(year < 2000) %>% 
  group_by(country) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
before_2000
# A tibble: 41 × 5
# Groups:   country [12]
   iso   country   year   mmr  Mean
   <chr> <chr>    <int> <dbl> <dbl>
 1 IND   India     1999  540   458.
 2 IND   India     1992  437   458.
 3 IND   India     1998  398   458.
 4 MNG   Mongolia  1993  259   191.
 5 MNG   Mongolia  1994  219   191.
 6 MNG   Mongolia  1992  204.  191.
 7 MNG   Mongolia  1995  186   191.
 8 MNG   Mongolia  1999  182   191.
 9 EGY   Egypt     1992  174   174 
10 MNG   Mongolia  1996  174.  191.
# … with 31 more rows
Code
  print(tail(before_2000,10))
# A tibble: 10 × 5
# Groups:   country [3]
   iso   country               year   mmr  Mean
   <chr> <chr>                <int> <dbl> <dbl>
 1 OMN   Oman                  1994  24.4  18.0
 2 SAU   Saudi Arabia          1997  23    23  
 3 ARE   United Arab Emirates  1995  22.4  27.5
 4 OMN   Oman                  1995  22    18.0
 5 OMN   Oman                  1996  21    18.0
 6 OMN   Oman                  1998  18.5  18.0
 7 OMN   Oman                  1992  15.9  18.0
 8 OMN   Oman                  1999  13.7  18.0
 9 OMN   Oman                  1997  13.3  18.0
10 OMN   Oman                  1993   6    18.0
Code
after_2000 <- mmr %>% 
  filter(year >= 2000) %>% 
  group_by(country) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
after_2000
# A tibble: 134 × 5
# Groups:   country [28]
   iso   country      year   mmr  Mean
   <chr> <chr>       <int> <dbl> <dbl>
 1 SSD   South Sudan  2005 2037  2037 
 2 AFG   Afghanistan  2000 1900  1900 
 3 CMR   Cameroon     2010  652   652 
 4 SDN   Sudan        2005  638   427.
 5 IND   India        2000  327   240.
 6 BGD   Bangladesh   2000  322.  258.
 7 MMR   Myanmar      2004  316.  316.
 8 IND   India        2002  301   240.
 9 IND   India        2005  254   240.
10 NPL   Nepal        2008  229   229 
# … with 124 more rows
Code
  print(tail(after_2000,10))
# A tibble: 10 × 5
# Groups:   country [3]
   iso   country               year   mmr   Mean
   <chr> <chr>                <int> <dbl>  <dbl>
 1 TUR   Turkey                2010  16.4 17.2  
 2 OMN   Oman                  2000  16.1 22.3  
 3 TUR   Turkey                2013  15.9 17.2  
 4 TUR   Turkey                2011  15.5 17.2  
 5 OMN   Oman                  2005  15.4 22.3  
 6 TUR   Turkey                2012  15.4 17.2  
 7 TUR   Turkey                2014  15.2 17.2  
 8 ARE   United Arab Emirates  2008   1.4  0.467
 9 ARE   United Arab Emirates  2005   0    0.467
10 ARE   United Arab Emirates  2000   0    0.467
Source Code
---
title: "HW2"
author: "Lai Wei"
desription: "gain experience with working with external data, dplyr, and the pipe operator."
date: "11/17/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - Lai Wei

---

In this HW, I will use mmr_2015.csv, which is a data set that contains a subset of the (real) data that were used to generate the United Nations Maternal mortality estimates, as published in the year 2015. 

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(babynames)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)
```

## Read Data

Background for  mmr_2015.csv: 
The maternal mortality ratio (MMR) is defined as the number of maternal deaths per 100,000 live births. The UN maternal mortality estimation group produces estimates of the MMR for all countries in the world.

```{r}
mmr <- read.csv("_data/mmr_2015.csv")
mmr
```
Variables in the data set mmr_2015.csv are as follows:

-   Iso = ISO code
-   Name = country name
-   Year = observation year
-   MMR = observed maternal mortality ratio, which is defined as the number of maternal deaths/total number of births*100,000

## Data Visualization

Construct a graph that shows the observed values of the MMR plotted against year (starting in 2000) for China and Viet Nam.Use the pipe operator so that the graph follows from a multi-line command that starts with “mmr %>%”.Use ggplot() to display the data.
```{r}
data_IT <- filter(mmr,country == "China"|country == "Viet Nam",year >= 2000)
ggplot(data = data_IT,aes(x = year,y= mmr))+
  geom_point(aes(group = country,color = country))
```


## Babynames

Babynames package is the Names of male and female babies born in the US from 1880 to 2017.Babynames was filtered to include only those rows with year > 1975, sex equal to male, and either prop > 0.025 or n > 50000.

```{r}
babynames %>% 
  filter(year > 1975, sex == "M",prop > 0.025|n > 50000) %>% 
  ggplot(aes(x = year, y = prop))+
  geom_point(aes(group = name,color = name), size = 2)+
  geom_line(aes(group = name, color = name))+
  expand_limits(y = 0)
```

## Tidy Table

Construct and print a tibble that shows the countries sorted by their average observed MMR (rounded to zero digits), with the country with the highest average MMR listed first.

```{r}
data1<- group_by(mmr,country) %>% 
  summarise_at(vars(mmr),list(name = mean))
  names(data1)[2] = "ave" 
  data1$ave <- round(data1$ave,0)
  arrange(data1,desc(ave))
```

##Continuing with the mmr data set

Part a: For each year
- first calculate the mean observed value for each country. 
- then rank countries by increasing MMR for each year. 

Calculate the mean ranking across all years, extract the mean ranking for 10 countries with the lowest ranking across all years, and print the resulting table. 

```{r}
data2<-
  mmr %>% 
  group_by(year) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
data2
  arrange(data2,desc(Mean)) 
lowest10 <- print(tail(data2,10))
```

Part b: With rankings calculated separately for two periods, with period 1 referring to years < 2000 and period 2 referring to years >= 2000. 

For each period

- first calculate the mean observed value for each country 
- then rank countries by increasing MMR for each period. 

Calculate the mean ranking across all periods, extract the 10 countries with the lowest ranking across all periods, and print the table.

```{r}
before_2000<-mmr %>% 
  filter(year < 2000) %>% 
  group_by(country) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
before_2000
  print(tail(before_2000,10))

after_2000 <- mmr %>% 
  filter(year >= 2000) %>% 
  group_by(country) %>% 
  mutate(Mean = mean(mmr,na.rm = TRUE)) %>% 
  arrange(desc(mmr))
after_2000
  print(tail(after_2000,10))
```